Importing relevant libraries
import os
import random
import numpy as np
import pandas as pd
import seaborn as sns
import plotly.express as px
#import tensorflow as tf
import math
from scipy import special #comb, factorial
#from keras import backend as K
from scipy.stats import uniform
from matplotlib import pyplot as plt
from sklearn import tree
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectKBest,chi2
from sklearn.preprocessing import MinMaxScaler, StandardScaler,LabelEncoder
from sklearn.metrics import classification_report, roc_auc_score, recall_score, make_scorer, plot_confusion_matrix, confusion_matrix, accuracy_score,f1_score
sns.set_style('darkgrid')
df = pd.read_csv('Groceries_dataset.csv')
df.head()
| Member_number | Date | itemDescription | |
|---|---|---|---|
| 0 | 1808 | 21-07-2015 | tropical fruit |
| 1 | 2552 | 05-01-2015 | whole milk |
| 2 | 2300 | 19-09-2015 | pip fruit |
| 3 | 1187 | 12-12-2015 | other vegetables |
| 4 | 3037 | 01-02-2015 | whole milk |
Basic info about the dataset
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 38765 entries, 0 to 38764 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Member_number 38765 non-null int64 1 Date 38765 non-null object 2 itemDescription 38765 non-null object dtypes: int64(1), object(2) memory usage: 908.7+ KB
Do we have nulls?
df.isnull().sum()
Member_number 0 Date 0 itemDescription 0 dtype: int64
No, we don't
df['itemDescription'].unique().size
167
df['Date'] = pd.to_datetime(df['Date'], infer_datetime_format=True)
plt.figure(figsize=(10,7))
sns.countplot(df['Date'].apply(lambda x: x.year))
plt.title('When were the purchases made?')
plt.xlabel('Year')
plt.show()
C:\Users\admin\anaconda3\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. warnings.warn(
We see that the dataset contains the purchases made only in 2014 and 2015. Now let's see more specific distribution
plt.figure(figsize=(10,7))
df3 = df.groupby('Date').count()['itemDescription'].reset_index()
plt.plot(df3['Date'],df3['itemDescription'])
plt.xlabel('Date')
plt.ylabel('Number of items bought')
plt.title('Number of items sold (each day)')
plt.show()
Let's reduce the noise by consider total count of items bougth each month.
plt.figure(figsize=(10,7))
df3 = df.copy()
df3['Date'] = df3['Date'].apply(lambda x: pd.to_datetime(f"{x.year}/{x.month}/{1}"))
df3 = df3.groupby('Date').count()['itemDescription'].reset_index()
plt.plot(df3['Date'],df3['itemDescription'])
plt.xlabel('Date')
plt.ylabel('Number of items bought')
plt.title('Number of items sold (each month)')
plt.show()
To get a better idea of whether there is some yearly trend, we will adjust the graph as follows:
df3 = df.copy()
df3['Year'] = df['Date'].apply(lambda x: x.year)
df3['Month'] = df['Date'].apply(lambda x: x.month)
df3.drop(['Member_number','Date'],axis=1,inplace=True)
df3 = df3.groupby(['Year','Month']).count().reset_index()
d_2014 = df3[df3['Year'] == 2014]
d_2015 = df3[df3['Year'] == 2015]
plt.figure(figsize=(10,7))
plt.plot(d_2014['Month'],d_2014['itemDescription'],label='2014')
plt.plot(d_2015['Month'],d_2015['itemDescription'],label='2015')
plt.title('Number of items sold (each month)')
plt.xlabel('Month')
plt.ylabel('item count')
plt.legend()
plt.show()
corr=d_2014.merge(right=d_2015,on='Month')[['itemDescription_x','itemDescription_y']].corr().values[0][1]
print(f'Correlation between sales in 2014 and 2015: {corr}')
Correlation between sales in 2014 and 2015: 0.4654402963659504
Couple notes can be made here:
Now let's look at the sales per each item
df1 = df.groupby('itemDescription').count().sort_values(by='Member_number',ascending=False).reset_index()
df1.rename(columns={'itemDescription': 'Item',
'Member_number': 'Number of sales'},inplace=True)
df1.drop(['Date'],axis=1,inplace=True)
df1
| Item | Number of sales | |
|---|---|---|
| 0 | whole milk | 2502 |
| 1 | other vegetables | 1898 |
| 2 | rolls/buns | 1716 |
| 3 | soda | 1514 |
| 4 | yogurt | 1334 |
| ... | ... | ... |
| 162 | rubbing alcohol | 5 |
| 163 | bags | 4 |
| 164 | baby cosmetics | 3 |
| 165 | preservation products | 1 |
| 166 | kitchen utensil | 1 |
167 rows × 2 columns
Summary of the Number of sales
df1['Number of sales'].describe()
count 167.000000 mean 232.125749 std 363.442098 min 1.000000 25% 30.500000 50% 85.000000 75% 264.000000 max 2502.000000 Name: Number of sales, dtype: float64
Let's look at the histogram plotting the distribution of the Number of sales
plt.figure(figsize=(10,7))
df1['Number of sales'].hist(alpha=0.6)
plt.xlabel('sales count')
plt.ylabel('item count')
plt.title("How many times each item was sold?")
plt.show()
We see that half of the items were purchased less than $86$ times. However, there are some outliers. Let's have a look at what the outliers are.
df1 = df.groupby('itemDescription').count().sort_values(by='Member_number',ascending=False).head(10).reset_index()
df1.drop(['Date'],axis=1,inplace=True)
df1.rename(columns={'itemDescription': 'Item',
'Member_number' : 'Number of sales'},inplace=True)
fig = px.bar(df1,
x='Item',
y='Number of sales',
title= 'Most purchased items')
fig.show()
Now let's have a look at the customers that bought the most items.
df1 = df.groupby('Member_number').count().sort_values(by='itemDescription',ascending=False).head(10).reset_index()
df1.drop(['Date'],axis=1,inplace=True)
df1.rename(columns={'itemDescription': 'Item count',
'Member_number' : 'Customer ID'},inplace=True)
df1['Customer ID'] = df1['Customer ID'].astype(str)
fig = px.bar(df1,
x='Customer ID',
y='Item count',
title='Customers that purchased the most items')
fig.show()
As things stand, our data is organized in a way that we can not determine how many items (and what items exactly) each customer bought PER EACH VISIT to the store. For example, have a look at the following table
df[df['Member_number'] == 1000].sort_values(by='Date').head(3)
| Member_number | Date | itemDescription | |
|---|---|---|---|
| 13331 | 1000 | 2014-06-24 | whole milk |
| 29480 | 1000 | 2014-06-24 | pastry |
| 32851 | 1000 | 2014-06-24 | salty snack |
We see that the customer with an ID 1000 bought 3 items on July 4, 2014. The problem is though, we don't know how many times he went to the store, and what he bought per each visit. For example, it is possible that he went to the store 3 times, each time buying single item. It is also possible that he went to the store 2 times, the first time he bought pastry and snack, and the second time he bought milk. Or maybe he went to the store only once, and bought all 3 items at once. Which one is it? We don't know. But to perform any meaningful association analysis, we need to know. Since no information was provided, we will make the following assumption:
Assumption. On each day, arbitrary customer went to the store ONLY ONCE.
With this assumption in mind, we can reorganize the dataset in a following way:
df1 = df.copy()
df1['itemDescription'] = df1['itemDescription'].apply(lambda x: [x,]).copy()
df1 = df1.groupby(['Member_number','Date']).agg(sum).reset_index()
df1.rename(columns={'itemDescription': 'Items bought'},inplace=True)
df1.head()
| Member_number | Date | Items bought | |
|---|---|---|---|
| 0 | 1000 | 2014-06-24 | [whole milk, pastry, salty snack] |
| 1 | 1000 | 2015-03-15 | [sausage, whole milk, semi-finished bread, yog... |
| 2 | 1000 | 2015-05-27 | [soda, pickled vegetables] |
| 3 | 1000 | 2015-07-24 | [canned beer, misc. beverages] |
| 4 | 1000 | 2015-11-25 | [sausage, hygiene articles] |
Items bought now represents the set of all items which were bought during a single visit to the store.
Let's see how many items customers purchase per each visit to the store.
df1['Basket size'] = df1['Items bought'].apply(lambda x: len(x))
df1['Basket size'].describe()
count 14963.000000 mean 2.590724 std 1.117469 min 2.000000 25% 2.000000 50% 2.000000 75% 3.000000 max 11.000000 Name: Basket size, dtype: float64
plt.figure(figsize=(10,7))
df1['Basket size'].hist(alpha=0.6)
plt.xlabel('item count')
plt.ylabel('customer count')
plt.title("How many items are being purchased each visit?")
plt.show()
We see that most customers purchase 2 items per each visit to the store.
Now we will use Association rule learning to check whether there are some patterns in the customers purchasing behavior. To generate set of relevant rules, we will use Apriori algorithm. Before proceeding, make sure that you are fimiliar with following concepts: association rule, support, confidence, lift (information regarding all of these concepts can be found in the here (see section "Useful Concepts")).
We will only generate rules where support exceeds $0.1\%$ and confidence exceeds $10\%$
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules
import mlxtend as ml
df1 = df.copy()
df1['itemDescription'] = df1['itemDescription'].apply(lambda x: [x,]).copy()
df1 = df1.groupby(['Member_number','Date']).agg(sum).reset_index()
df1.rename(columns={'itemDescription': 'Items bought'},inplace=True)
all_items = df['itemDescription'].unique()
data = []
for transaction in df1['Items bought']:
row = []
for item in all_items:
if item in transaction:
row.append(1)
else:
row.append(0)
data.append(row)
df2 = pd.DataFrame(data,columns=all_items)
df2 = df2.rename_axis('Transcation ID')
frequent_itemsets = apriori(df2, min_support=0.001, use_colnames=True)
rules = association_rules(frequent_itemsets, metric="lift")
rules.sort_values('confidence', ascending = False, inplace = True)
C:\Users\admin\anaconda3\lib\site-packages\mlxtend\frequent_patterns\fpcommon.py:111: DeprecationWarning: DataFrames with non-bool types result in worse computationalperformance and their support might be discontinued in the future.Please use a DataFrame with bool type
rules = rules[rules['confidence'] > 0.1].copy()
rules.head()
| antecedents | consequents | antecedent support | consequent support | support | confidence | lift | leverage | conviction | |
|---|---|---|---|---|---|---|---|---|---|
| 716 | (sausage, yogurt) | (whole milk) | 0.005748 | 0.157923 | 0.001470 | 0.255814 | 1.619866 | 0.000563 | 1.131541 |
| 711 | (sausage, rolls/buns) | (whole milk) | 0.005347 | 0.157923 | 0.001136 | 0.212500 | 1.345594 | 0.000292 | 1.069304 |
| 723 | (sausage, soda) | (whole milk) | 0.005948 | 0.157923 | 0.001069 | 0.179775 | 1.138374 | 0.000130 | 1.026642 |
| 125 | (semi-finished bread) | (whole milk) | 0.009490 | 0.157923 | 0.001671 | 0.176056 | 1.114825 | 0.000172 | 1.022008 |
| 705 | (yogurt, rolls/buns) | (whole milk) | 0.007819 | 0.157923 | 0.001337 | 0.170940 | 1.082428 | 0.000102 | 1.015701 |
Let's see how many assofication rules we are dealing with:
rows = rules.shape[0]
print(f'Number of rules: {rows}')
Number of rules: 99
Sstatistics summary regarding support:
rules['support'].describe()
count 99.000000 mean 0.002737 std 0.002233 min 0.001002 25% 0.001337 50% 0.001804 75% 0.003342 max 0.013968 Name: support, dtype: float64
We see that the support for all rules in our dataset is very low (i.e., the proportion of transactions that involve items from both baskets), which may be problematic, due to the fact that any results obtained from analysis may not be statistically significant.
Let's see the rules with the highest lift
rules.sort_values(by='lift',ascending=False).head(10).iloc[:,:-2][['antecedents',
'consequents',
'consequent support',
'lift']]
| antecedents | consequents | consequent support | lift | |
|---|---|---|---|---|
| 717 | (yogurt, whole milk) | (sausage) | 0.060349 | 2.182917 |
| 718 | (sausage, whole milk) | (yogurt) | 0.085879 | 1.911760 |
| 716 | (sausage, yogurt) | (whole milk) | 0.157923 | 1.619866 |
| 20 | (flour) | (tropical fruit) | 0.067767 | 1.617141 |
| 560 | (processed cheese) | (root vegetables) | 0.069572 | 1.513019 |
| 492 | (soft cheese) | (yogurt) | 0.085879 | 1.474952 |
| 471 | (detergent) | (yogurt) | 0.085879 | 1.444261 |
| 489 | (chewing gum) | (yogurt) | 0.085879 | 1.358508 |
| 711 | (sausage, rolls/buns) | (whole milk) | 0.157923 | 1.345594 |
| 248 | (processed cheese) | (rolls/buns) | 0.110005 | 1.315734 |
We see that itemsets (yogurt, whole milk) and (sausage) have the highest lift. That means that once we know that a customer has bought yogurt and whole milk, it becomes 2.2 times more likely that he will also buy sausage. But as we've pointed out, due to the fact that the support is very low, we cannot really determine whether this is just a fluke or a real association.
Similarly, we will check the the rules with the lower lift (i.e., rules where the items in antecedent and consequent are unlikely to be bought together).
rules.sort_values(by='lift',ascending=True).head(10).iloc[:,:-2][['antecedents',
'consequents',
'consequent support',
'lift']]
| antecedents | consequents | consequent support | lift | |
|---|---|---|---|---|
| 108 | (margarine) | (whole milk) | 0.157923 | 0.801379 |
| 121 | (hygiene articles) | (whole milk) | 0.157923 | 0.803109 |
| 55 | (rolls/buns) | (whole milk) | 0.157923 | 0.804028 |
| 100 | (hard cheese) | (whole milk) | 0.157923 | 0.805917 |
| 98 | (ice cream) | (whole milk) | 0.157923 | 0.808960 |
| 56 | (pot plants) | (whole milk) | 0.157923 | 0.811821 |
| 77 | (canned beer) | (whole milk) | 0.157923 | 0.811821 |
| 67 | (fruit/vegetable juice) | (whole milk) | 0.157923 | 0.821072 |
| 68 | (yogurt) | (whole milk) | 0.157923 | 0.822940 |
| 113 | (oil) | (whole milk) | 0.157923 | 0.823471 |
Let's visualize the relation between support, confidence and lift
sup = rules['support'].values
conf = rules['confidence'].values
lift = rules['lift'].values
plt.figure(figsize=(10,6))
sc = plt.scatter(sup,conf,c=lift)
plt.colorbar(sc,label='Lift')
plt.xlabel('support')
plt.ylabel('confidence')
plt.show()
As we see, most rules have both low confidence and low support. However, there is one rule that we may want to single out: the rule with the confidence of about $26\%$:
rules[rules['confidence'] > 0.24]
| antecedents | consequents | antecedent support | consequent support | support | confidence | lift | leverage | conviction | |
|---|---|---|---|---|---|---|---|---|---|
| 716 | (sausage, yogurt) | (whole milk) | 0.005748 | 0.157923 | 0.00147 | 0.255814 | 1.619866 | 0.000563 | 1.131541 |
We see that the probability of a customer buying milk is roughly $16\%$. But given that the customer has bougth yougurt and sausage, the probability of buying milk increases to $25\%$ (this also signifies that the rule has a high lift). But again, support is very low (only $0.1\%$). So while this rule seems to be the most promising out of rules our dataset contains, due to the fact that the support is very low, one should not make any rash conclusions about the association between the two itemsets without further investigation. How could one investigate? One option is to come up with a way to directly (or indirectly) ask those customers who bought yogurt, sausage and whole milk about why they buy these items together.
Based on the graph above, one could also see some rules with a relatively decent support (i.e., support exceeds $1\%$), let's check what those rules are:
rules[rules['support'] > 0.01]
| antecedents | consequents | antecedent support | consequent support | support | confidence | lift | leverage | conviction | |
|---|---|---|---|---|---|---|---|---|---|
| 68 | (yogurt) | (whole milk) | 0.085879 | 0.157923 | 0.011161 | 0.129961 | 0.822940 | -0.002401 | 0.967861 |
| 55 | (rolls/buns) | (whole milk) | 0.110005 | 0.157923 | 0.013968 | 0.126974 | 0.804028 | -0.003404 | 0.964550 |
Not only confidence is low for both of these rules, but the lift is actually less than 1. For example, the probability of a customer buying whole milk is about $16\%$. But if we know that the customer bought yogurt, then the probability of buying whole milk drops to $13\%$. The upshot is: these rules are not of much interest to us.
Over the span of two years, half of the shopping items were bought less than 85 items each. The most popular items (i.e, those items that were bought more than 1k times) are:
Per each visit to the store, half of the customers only purchases 2 items (or less), and 95% of the customers purchase 5 items (or less).
Using Apriori algorithm (AA), we've found that most rules have very low support (which implies that even if one finds one rule to have a strong association, the association might not be statistically significant). Furthermore, using AA we've found that most rules have low confidence and low lift (which signifies a weak association). The only rule which may have a meanignful association is $$\text{(yogurt, sausage)} \implies \text{(whole milk)}$$ the reason why the rule may be meaningful is because of the highest confidence out of all rules ($\approx 26\%$), and one of the highest lifts ($\approx 1.6$). This means that, once we know the customer has purchased yogurt and sausage, the probability of the customer also buying whole milk significantly increases. However, given that the support is low ($\approx 0.1\%$), one should be careful before making any conclusions about whether the association is significant.